Textual Information Extraction in Document Images Guided by a Concept Lattice
نویسندگان
چکیده
Text Information Extraction in images is concerned with extracting the relevant text data from a collection of document images. It consists in localizing (determining the location) and recognizing (transforming into plain text) text contained in document images. In this work we present a textual information extraction model consisting in a set of prototype regions along with pathways for browsing through these prototype regions. The proposed model is constructed in four steps: (1) produce synthetic invoice data containing the textual information of interest, along with their spatial positions; (2) partition the produced data; (3) derive the prototype regions from the obtained partition clusters; (4) build the concept lattice of a formal context derived from the prototype regions. Experimental results, on a corpus of 1000 real-world scanned invoices show that the proposed model improves significantly the extraction rate of an Optical Character Recognition (OCR) engine.
منابع مشابه
Automatic Annotation of Images, Pictures or Videos Comments for Text Mining Guided by No Textual Data
The Text mining guided by No Textual data (TNT) is not intended to extract the information contained in the images, aiming the information included in the text that describes these images. In other words, it aims to present to the reader the information about the images next to them, regardless of its real position in the document. Reading, while focusing on no textual data (images, pictures, v...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملJump-starting Concept Map Construction with Knowledge Extracted from Documents
Online documents provide a rich information resource for aiding the generation of concept-map-based knowledge models, but analyzing resources to select concepts and links is a time consuming task. This paper describes ongoing research on harnessing the information in unstructured textual documents, using information extraction algorithms, to generate a preliminary version of a concept map from ...
متن کاملConceptual Modeling with Formal Concept Analysis on Natural Language Texts
The paper presents conceptual modelling technique on natural language texts. This technique combines the usage of two conceptual modeling paradigms: conceptual graphs and Formal Concept Analysis. Conceptual graphs serve as semantic models of text sentences and the data source for concept lattice – the basic conceptual model in Formal Concept Analysis. With the use of conceptual graphs the Text ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016